Skip to content

ClickHouse version of import users script#1317

Open
jamesqo wants to merge 7 commits into
masterfrom
import-users-ch
Open

ClickHouse version of import users script#1317
jamesqo wants to merge 7 commits into
masterfrom
import-users-ch

Conversation

@jamesqo

@jamesqo jamesqo commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Summary of changes

  • Removed all read/write interactions with MySQL
  • Moved all read operations with MySQL -> ClickHouse
  • Replace usage of clickhouse CLI with clickhouse-connect library (will be useful in the future when this script is moved to the Airflow server)
  • Update users / authorities table in a single batched SQL transaction
  • Removed --ssl-ca flag which no longer applies (this was only relevant for RDS)

The portal.properties files also need to be updated in order to include clickhouse creds -- will submit a PR to pipelines-configuration later

cc @averyniceday @sheridancbio

@jamesqo

jamesqo commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

still need to test changes here

@jamesqo

jamesqo commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

Did some testing. Further changes:

  • Converted script to Python3
  • Added more logging
  • Speed it up by doing a batched query for authorities in 1 swoop

added_user_rows.append([user_email, user_name, user.enabled])
added_authority_rows += [[user_email, authority] for authority in user.authorities]
if added_user_rows:
ch_client.insert('users', added_user_rows, column_names=['email', 'name', 'enabled'])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add error checking like we were doing before for SQL here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes .. if the insert fails I wonder what happens. I suppose ch_client will throw an exception? At the very least we should confirm that if this happens we will get a stacktrace or some kind of indication in the run logs. A notice to our slack channel (letting us know that we failed an attempt to add a user to the user table) should probably also be made.

@jamesqo jamesqo Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added back code that will email the failure message to users if there is an exception thrown during inserts.

Comment thread import-scripts/importUsers.py

@averyniceday averyniceday left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General changes look good - couple suggestions /questions regarding readiability and organization

Comment thread import-scripts/importUsers.py
Comment thread import-scripts/importUsers.py
Comment thread import-scripts/importUsersClickhouse.py

@sheridancbio sheridancbio left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I breezed over the parts which seemed to be clear language conversions to accommodate python 3 syntax. So long as this works in production I approve. I suppose the inherent retry is still present because each time this runs we re-compute the list of users who are missing from the database and try to import them again.

Comment thread import-scripts/import-portal-users.sh
Comment thread import-scripts/importUsers.py
added_user_rows.append([user_email, user_name, user.enabled])
added_authority_rows += [[user_email, authority] for authority in user.authorities]
if added_user_rows:
ch_client.insert('users', added_user_rows, column_names=['email', 'name', 'enabled'])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes .. if the insert fails I wonder what happens. I suppose ch_client will throw an exception? At the very least we should confirm that if this happens we will get a stacktrace or some kind of indication in the run logs. A notice to our slack channel (letting us know that we failed an attempt to add a user to the user table) should probably also be made.

Comment thread import-scripts/importUsers.py
@jamesqo

jamesqo commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

@sheridancbio I'm wary of adding code for Slack notifications -- if something goes wrong with the network for example, we will be getting pinged every minute by this script. I did improve the error handling for failed DB operations however

@jamesqo

jamesqo commented Apr 14, 2026

Copy link
Copy Markdown
Contributor Author

@copilot resolve the merge conflicts in this pull request

@jamesqo

jamesqo commented Apr 14, 2026

Copy link
Copy Markdown
Contributor Author

We decided that we will be leaving the behavior on failure as-is for now but we should probably address it in the future so that users don't get spammed with emails if something goes wrong with our networking, for example.

SMTP_SERVER=`cat $MAIL_SMTP_SERVER`
GENIE_BLUE_DATABASE_PROPERTIES_FILENAME="portal.properties.genie.blue"
GENIE_GREEN_DATABASE_PROPERTIES_FILENAME="portal.properties.genie.green"
GENIE_BLUE_CLICKHOUSE_CLIENT_CONFIG_FILEPATH="/data/portal-cron/pipelines-credentials/clickhouse_client_genie_blue_config.yaml"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed these lines as we're no longer using the clickhouse CLI

@averyniceday averyniceday left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me - only one change -- let's just replace the contents of the existing importUsers.py script instead of having a separate importUsersClickhouse.py script. That way we do cleanup as part of this rollout, thanks!

@sheridancbio

sheridancbio commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Overall, I'm good with merging this PR. But one thing I'm not sure I understand ... why did we switch from using the command line executable "clickhouse client" and switch to using the python third party library "clickhouse_connect"?

I'm not against this ... but it means that now we have some parts of our code using clickhouse client and some parts using "clickhouse_connect". Both libraries come from the same organization ... but having 2 dependencies is worse than having 1. Perhaps we should stop using clickhouse client and do all clickhouse interactions through clickhouse_connect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants